Penggunaan Kamus Singkatan Kata Bahasa Indonesia Sehari-Hari dalam Pembangkitan Fitur Teks

Citra Lestari; Kenny Jihiro; Andreas Lim; Daniel Aprilio; Franciscus Valentinus

doi:10.32493/informatika.v8i2.29306

Penggunaan Kamus Singkatan Kata Bahasa Indonesia Sehari-Hari dalam Pembangkitan Fitur Teks

Authors

Citra Lestari Universitas Ciputra Surabaya
Kenny Jihiro Universitas Ciputra Surabaya
Andreas Lim Universitas Ciputra Surabaya
Daniel Aprilio Universitas Ciputra Surabaya
Franciscus Valentinus Universitas Ciputra Surabaya

DOI:

https://doi.org/10.32493/informatika.v8i2.29306

Keywords:

Indoenesian informal abbreviation, dictionary, lemming, features generation

Abstract

Natural Language Processing (NLP) research on Indonesian language is relatively slow compared to other languages, such as English or Chinese. Most of the researches are dealing with Indonesian formal textes. Some NLP researches that are dealing with Indoensian informal texts are having quite difficulty since Indonesian informal language usually combines formal language, daily language, and local language. In addition, there is a habit in Indoensians to use abbreviation in texting. These cause great difficulty in features generation process, where machines fail to identify stopwords and form lemmas from the bag of words. There are actually dictionaries that can be used to do lemming process for Indonesian forma language, daily language, local languages, and even Indoensian formal abbrevations. But there is stil no dictionary for Indoensian informal abbrevations. This research made an Indonesian informal abbrevations dictionary from 4000 Indonesian tweets. The dictionary contains 706 unique abbrevations as its corpus. The dictionary then used to generate features. In this research, the features generation only used this dictionary to measure its signiicancy. The feature generation with the Indonesian informal abbrevations dictionary were tested with Indonesian tweets about Covid-19 Vaccine. The features generation process was able to identify 2262 abbrevations wotj 71,09% of them were identified as stopwords. To take a further step, the features generated then being tested to figure out their impact in sentiment analysis. The sentiment analysis used Multi-Layer Perceptron. Unfortunately, those features didnâ€™t increase the performance of the sentiment analysis. The accuracy decreased by 3,5% while the precision, recall, and F1-Score decreased in range of 0,02 â€“ 0,04. With this result, it can be concluded that the use of this dictionaty alone for lemming process is not enough. It needs to be combined with other dictionary to have more optimal result.

References

Elcholiqi, A. & Musdholifah, A. (2020). Chatbot in Bahasa Indonesia using NLP to Provide Banking Information. Indonesian Journal of Computing and Cybermetrics Systems. 14(1). https://doi.org/10.22146/ijccs.41289

Floridi, L. & Chiriatti, M. (2020). GPT-3: Its Nature, Scope, Limits, and Consequence. Minds & Machines 30, 681â€“694. https://doi.org/10.1007/s11023-020-09548-1

Ibrahim, M.O. & Budi, I. (2018). A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media. Procedia Computer Science, Vol 135 p. 222-229. 1877-0509 https://doi.org/10.1016/j.procs.2018.08.169

Kameswari, A.V.N. (2021). Image Caption Generator Using Deep Learning. International Journal for Research in Applied Science and Engineering Technology Vol 9 (10) pp. 1554-1564.

Lesson, W., Resnick, A., Alexander, D., & Rovers, J. (2019). Natural Language Processing (NLP) in Qualitative Public Health Research: A Proof of Concept Study. International Journal of Qualitative Methods vol 18 p. 1-9. https://doi.org/10.1177%2F1609406919887021

Lestari, C., Saputri, T.R.D., & Siahaan, S.C.P. (2022). Analisis Sentimen Pandangan Netizen Indonesia terhadap Vaksin COVID-19 menggunakan Multi-Layer Perceptron. Jurnal Teknik Informatika dan Sistem Informasi vol. 9(4).

Nayoga, B.P., Adipradana, R., Suryadi, R., & Suhartono D. (2021). Hoax Analyzer for Indonesian News Using Deep Learning Models. Procedia Computer Science vol 179 p. 704-712. 1877-0509 https://doi.org/10.1016/j.procs.2021.01.059

Nazeer, I., Rashid, M., Gupta, Dr. S., & Kumar, A. (2020). Use of Novel Ensemble Machine Learning Approach for Social Media Sentiment Analysis. Analyzing Global Social Media Consumption, IGI Global.

Ratnasari, C.I., Kusumadewi, S., & Rosita, L. (2014). Model Natural Language Processing untuk Perumusan Keluhan Pasien. Seminar Nasional Informatika Medis (SNIMed) V.

Salsabila, N.A., Winatmoko. Y,A,, Septiandri, A.A., & Jamal, A. (2018). Colloquial Indonesian Lexicon. International Conference on Asian Language Processing (IALP) p. 226-229. 10.1109/IALP.2018.8629151.

Zaky, D. (2019). Kumpulan Kata Bahasa Indonesia. Github: https://github.com/damzaky/kumpulan-kata-bahasa-indonesia-KBBI diakses pada tanggal 26 Juli 2022.

Jurnal Informatika Universitas Pamulang Vol. 8 No. 2 Juni 2023

Downloads

Published

2023-06-30

Issue

Vol. 8 No. 2 (2023): JURNAL INFORMATIKA UNIVERSITAS PAMULANG

Section

Article

License

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).

Jurnal Informatika Universitas Pamulang have CC-BY-NC or an equivalent license as the optimal license for the publication, distribution, use, and reuse of scholarly work.

In developing strategy and setting priorities, Jurnal Informatika Universitas Pamulang recognize that free access is better than priced access, libre access is better than free access, and libre under CC-BY-NC or the equivalent is better than libre under more restrictive open licenses. We should achieve what we can when we can. We should not delay achieving free in order to achieve libre, and we should not stop with free when we can achieve libre.

Jurnal Informatika Universitas Pamulang is licensed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

YOU ARE FREE TO:

Share : copy and redistribute the material in any medium or format
Adapt : remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms

Penggunaan Kamus Singkatan Kata Bahasa Indonesia Sehari-Hari dalam Pembangkitan Fitur Teks

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

YOU ARE FREE TO:

Citedness Scopus

Akreditasi

Menu Utama

Visitor Statistics

Template